Purpose¶

analysis of variance.¶

In [54]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy import stats as st

Instructions¶

Problem 1¶

Our company is working on the dosing regimen for a new therapeutic compound. The compound will be administered in a tablet with a time-release coating. Both the thickness of the time-release coating and the total mass of compound in the tablet will influence the plasma concentration profile of the drug over time. Furthermore, biological variation in kidney function creates variation in each sample.

We have conducted the first round of exploratory experiments examining the effect of both the mass of the compound per tablet and the thickness of the time release coating in mice. There were four levels of coating thickness and four levels of compound mass included in the experiment. Five mice were subjected to each combination of factor levels, for a total or 80 measurements. Blood samples were collected on all mice at 4 and 8 hours. As a junior employee your access to proprietary knowledge is limited, so the values of the levels are not supplied. Your task is to perform a basic exploratory data analysis with two-way ANOVA and write up the results to present to the rest of the team. The tablet thickness is labeled "t" and the mass of the compound is labeled "c", each level is denoted with an integer.

  1. Create a box and whisker plot for each factor at each time. (Hint: There will only be two plots for each time set. You do not need to separate out each combination of factor levels. Four box-plots total.)
  2. Perform a two-way analysis of variance to determine the significance of each factor at each point in time. In other words, perform the ANOVA on the data set at 4 hours and also the data set at 8 hours.
  3. Analyze the results of the ANOVA tests to determine if each factor has a statistically significant influence on the plasma concentration at that time.
  4. If the goal was to achieve the highest blood concentration at any time, which combination of factor levels would you choose?
  5. If the goal was to achieve the highest blood concentration at 8 hours, which combination of factor levels would you choose?
  6. If the goal was to achieve the most consistent blood concentration over time, which combination of factor levels would you choose?

The data sets are given below.

  • data 4.csv
  • data_8.csv

Your conclusions should be clearly stated in your Jupyter Notebook.

For a review of two-way ANOVA see this resource: http://www.real-statistics.com/two-way-anova/two-factor-anova-with-replication/_

In [2]:
df_4 = pd.read_csv('data_4.csv')
In [3]:
df_4.head()
Out[3]:
Unnamed: 0 t1 t2 t3 t4
0 c1 0.0208 0.3208 0.4208 0.7208
1 c1 0.0000 0.1436 0.2436 0.5436
2 c1 0.4664 0.7664 0.8664 1.1664
3 c1 0.0000 0.0000 0.0000 0.0000
4 c1 1.9642 2.2642 2.3642 2.6642
In [4]:
df_4 = pd.melt(df_4, id_vars=['Unnamed: 0'], var_name='Thickness')
df_4 = df_4.rename(columns={"Unnamed: 0": "Mass","value":'Plasma_Concentration'})
In [5]:
df_4.head()
Out[5]:
Mass Thickness Plasma_Concentration
0 c1 t1 0.0208
1 c1 t1 0.0000
2 c1 t1 0.4664
3 c1 t1 0.0000
4 c1 t1 1.9642
In [6]:
df_8 = pd.read_csv('data_8.csv')
In [7]:
df_8.head()
Out[7]:
Unnamed: 0 t1 t2 t3 t4
0 c1 0.9201 1.2201 1.3201 1.6201
1 c1 0.3800 0.6800 0.7800 1.0800
2 c1 0.3652 0.6652 0.7652 1.0652
3 c1 0.0000 0.0000 0.0018 0.3018
4 c1 1.4187 1.7187 1.8187 2.1187
In [8]:
df_8 = pd.melt(df_8, id_vars=['Unnamed: 0'], var_name='Thickness')
df_8 = df_8.rename(columns={"Unnamed: 0": "Mass","value":'Plasma_Concentration'})
In [9]:
df_8.head()
Out[9]:
Mass Thickness Plasma_Concentration
0 c1 t1 0.9201
1 c1 t1 0.3800
2 c1 t1 0.3652
3 c1 t1 0.0000
4 c1 t1 1.4187

Create a box and whisker plot for each factor at each time. (Hint: There will only be two plots for each time set. You do not need to separate out each combination of factor levels. Four box-plots total.)¶

In [10]:
df_4.columns
Out[10]:
Index(['Mass', 'Thickness', 'Plasma_Concentration'], dtype='object')
In [11]:
df_4.boxplot(column=['Plasma_Concentration'],by = 'Mass',figsize = (5,6))
plt.show()
In [12]:
df_4.boxplot(column=['Plasma_Concentration'],by = 'Thickness',figsize = (5,6))
plt.show()
In [13]:
df_8.boxplot(column=['Plasma_Concentration'],by = 'Mass',figsize = (5,6))
plt.show()
In [14]:
df_8.boxplot(column=['Plasma_Concentration'],by = 'Thickness',figsize = (5,6))
plt.show()

Perform a two-way analysis of variance to determine the significance of each factor at each point in time. In other words, perform the ANOVA on the data set at 4 hours and also the data set at 8 hours.¶

4 Hours¶
In [15]:
model = ols('Plasma_Concentration ~ C(Mass) + C(Thickness) + C(Mass):C(Thickness)', data=df_4).fit()
sm.stats.anova_lm(model, typ=2)
Out[15]:
sum_sq df F PR(>F)
C(Mass) 20.702632 3.0 6.368337 0.000761
C(Thickness) 4.075817 3.0 1.253762 0.297774
C(Mass):C(Thickness) 0.054839 9.0 0.005623 1.000000
Residual 69.351877 64.0 NaN NaN
8 Hours¶
In [16]:
model = ols('Plasma_Concentration ~ C(Mass) + C(Thickness)', data=df_8).fit()
sm.stats.anova_lm(model, typ=2)
Out[16]:
sum_sq df F PR(>F)
C(Mass) 3.670774 3.0 3.130997 0.030712
C(Thickness) 4.584936 3.0 3.910734 0.011984
Residual 28.528345 73.0 NaN NaN

Analyze the results of the ANOVA tests to determine if each factor has a statistically significant influence on the plasma concentration at that time.¶

We can see the following p-values for each of the factors in the table:

4 Hours

  • C(Mass): p-value = 0.000761
  • C(Thickness): p-value = 0.297774

Since the p-value for C(Mass) and is smaller than .05, this means that Mass has a statistically significant influence on the plasma concentration at 4 hours.

Since the p-value for C(Thickness) is greater than .05, this means that Thickness has no statistically significant influence on the plasma concentration at 4 hours.

8 Hours

  • C(Mass): p-value = 0.050106
  • C(Thickness): p-value = 0.022128

Since the p-values for C(Thickness) is smaller than .05, this means that Thickness has a statistically significant influence on the plasma concentration at 8 hours.

Since the p-values for C(Mass) is greater than .05, this means that Mass has no statistically significant influence on the plasma concentration at 8 hours.

If the goal was to achieve the highest blood concentration at any time, which combination of factor levels would you choose?¶

Ans: C4 and T4

If the goal was to achieve the highest blood concentration at 8 hours, which combination of factor levels would you choose?¶

Ans: C4 and T4

If the goal was to achieve the most consistent blood concentration over time, which combination of factor levels would you choose?¶

Ans: C3 and T3

Problem 2:¶

  1. Import the file "Biocharcolas.xlsx" you used for the Midterm Project into a Jupyter notebook. Plot the treatment means and standard deviations for the "Grav" variable
    • Note: most boxplots plot quartiles which are NOT means and standard deviations. We recommend using something like the df.plot function in Pandas (described here)
    • Feel free to copy relevant set-up code from what you already developed for the Midterm!
  1. Perform a I-way ANOVA for the factor "Soil" (HINT: check out the df.dropna() function)

    • A. Generate your hypotheses
    • B. Calculate the 1-way ANOVA p-value
    • C. In a markdown cell, interpret your findings including what the p-value from an ANOVA can tell you and any conclusions you can make about the factor you measured. With only 2 groups to compare, was a 1-way ANOVA the right test to perform?
  1. Perform a I-way ANOVA for the factor "Biochar"
    • A. Generate your hypotheses
    • B. Calculate the I-way ANOVA p-value
    • C. In a markdown cell, interpret your findings including any conclusions you can make about the factor you measured.
  1. Perform a 2-way ANOVA for the factors "Soil" and "Biochar" with the interaction

    • A. Generate your hypotheses
    • B. Calculate the 2-way ANOVA p-values
    • C. In a markdown cell, interpret your findings including any conclusions you can make about the factors and interaction you measured.
  2. Perform a Tukey's HSD posthoc test on your 2-way ANOVA (Hint: first use the df.dropna() function again, but only using the "Grav" column)

    • A. Generate the pair-wise comparison table for a Tukey's HSD posthoc test (from statsmodels.stats.multicomp import pairwise_tukeyhsd — described here)
    • B. In a markdown cell, give an overall interpretation the findings of your posthoc test (please don't describe every pair-wise comparison!). Are there any trends? Does this visually corroborate with your graph from question 2? Does this reinforce your findings from both 1-way ANOVAs?
    • C. In a markdown cell, describe whether every pair-wise comparison tells you something useful?

1. Import the file "Biocharcolas.xlsx" you used for the Midterm Project into a Jupyter notebook. Plot the treatment means and standard deviations for the "Grav" variable¶

  • Note: most boxplots plot quartiles which are NOT means and standard deviations. We recommend using something like the df.plot function in Pandas (described here)
  • Feel free to copy relevant set-up code from what you already developed for the Midterm!
In [17]:
df = pd.read_excel('Biocharcoals.xlsx')
In [18]:
df.head()
Out[18]:
Soil Biochar Replicate Treatment day Temp DeltaCO2 FractionCO2 Grav Nitrate
0 Chehalis 350 1 Chehalis.350 0 NaN NaN NaN 0.427248 5.079171
1 Chehalis 350 1 Chehalis.350 1 24.30 0.005872 0.707971 0.420356 11.092042
2 Chehalis 350 1 Chehalis.350 2 24.60 0.006985 0.642727 0.419754 10.118050
3 Chehalis 350 1 Chehalis.350 3 25.15 0.004096 0.458804 0.422973 10.903952
4 Chehalis 350 1 Chehalis.350 4 25.20 0.003023 0.356825 0.424141 12.643656
In [19]:
df = df.dropna()
df.head()
Out[19]:
Soil Biochar Replicate Treatment day Temp DeltaCO2 FractionCO2 Grav Nitrate
1 Chehalis 350 1 Chehalis.350 1 24.30 0.005872 0.707971 0.420356 11.092042
2 Chehalis 350 1 Chehalis.350 2 24.60 0.006985 0.642727 0.419754 10.118050
3 Chehalis 350 1 Chehalis.350 3 25.15 0.004096 0.458804 0.422973 10.903952
4 Chehalis 350 1 Chehalis.350 4 25.20 0.003023 0.356825 0.424141 12.643656
5 Chehalis 350 1 Chehalis.350 7 25.06 0.005722 0.262590 0.421858 21.088875
In [20]:
df_grouped = df.groupby('Treatment', as_index = False)['Grav'].agg({'mean','std'},axis = 0).reset_index(level=[0])
In [21]:
df_grouped
Out[21]:
Treatment mean std
0 Chehalis.0 0.396191 0.025297
1 Chehalis.350 0.431703 0.029243
2 Chehalis.500 0.443616 0.026858
3 Chehalis.700 0.438545 0.060386
4 Willamette.0 0.347809 0.046381
5 Willamette.350 0.348051 0.021975
6 Willamette.500 0.350081 0.020354
7 Willamette.700 0.349836 0.020466
In [22]:
df_grouped.set_index('Treatment').plot.bar(color = ['b','r'],rot=45)
plt.title("Mean and Standard Deviation of Grav by Treatment")
plt.ylabel('Values')
plt.show()

2. Perform a 1-way ANOVA for the factor "Soil" (HINT: check out the df.dropna() function)¶

  • A. Generate your hypotheses
  • B. Calculate the 1-way ANOVA p-value
  • C. In a markdown cell, interpret your findings including what the p-value from an ANOVA can tell you and any conclusions you can make about the factor you measured. With only 2 groups to compare, was a 1-way ANOVA the right test to perform?
  • The null hypothesis (H0): There is no difference among group means of Grav in soil.
  • The alternate hypothesis (Ha): Atleast one group differs significantly from the overall mean of Grav in soil.
In [23]:
df['Soil'].value_counts()
Out[23]:
Chehalis      108
Willamette    108
Name: Soil, dtype: int64
In [24]:
Chehalis_idx = df["Soil"] == "Chehalis"
Willamette_idx = df["Soil"] == "Willamette"
In [25]:
Chehalis_Grav = df[Chehalis_idx]["Grav"]
Willamette_Grav = df[Willamette_idx]["Grav"]
In [28]:
stat, pval = st.f_oneway(Chehalis_Grav,Willamette_Grav)
print('p-values',pval)
if pval < 0.05:    # alpha value is 0.05 or 5%
   print("Null Hypothesis is Rejected")
else:
  print("Failed to Reject Null Hypothesis")
p-values 2.6477869219833055e-38
Null Hypothesis is Rejected
  • This means we do have sufficient evidence to say that there is a statistically significant difference between the mean Grav of the two groups.

  • We can perform one-way ANOVA with two groups (Ideally used for 3 or more groups). T test is generally the more go to approach for two groups

Perform a 1-way ANOVA for the factor "Biochar"¶

  • A. Generate your hypotheses
  • B. Calculate the I-way ANOVA p-value
  • C. In a markdown cell, interpret your findings including any conclusions you can make about the factor you measured.
  • The null hypothesis (H0): There is no difference among group means of Grav in Biochar.
  • The alternate hypothesis (Ha): Atleast one group differs significantly from the overall mean of Grav in Biochar.
In [29]:
df['Biochar'].value_counts()
Out[29]:
350     54
500     54
700     54
None    54
Name: Biochar, dtype: int64
In [47]:
idx_350 = df["Biochar"] == 350
idx_500 = df["Biochar"] == 500
idx_700 = df["Biochar"] == 700
idx_None = df["Biochar"] == 'None'
In [48]:
Grav_350 = df[idx_350]["Grav"]
Grav_500 = df[idx_500]["Grav"]
Grav_700 = df[idx_700]["Grav"]
Grav_None = df[idx_None]["Grav"]
In [49]:
stat, pval = st.f_oneway(Grav_350,Grav_500,Grav_700,Grav_None)
print('p-values',pval)
if pval < 0.05:    # alpha value is 0.05 or 5%
   print("Null Hypothesis is Rejected")
else:
  print("Failed to Reject Null Hypothesis")
p-values 0.06719121812784043
Failed to Reject Null Hypothesis
  • This means we don’t have sufficient evidence to say that there is a statistically significant difference between the mean Grav of the four groups.

Perform a 2-way ANOVA for the factors "Soil" and "Biochar" with the interaction¶

  • A. Generate your hypotheses
  • B. Calculate the 2-way ANOVA p-values
  • C. In a markdown cell, interpret your findings including any conclusions you can make about the factors and interaction you measured.
  • H0: Soil and Biochar exposure have no significant effect on Grav
  • Ha: Soil and Biochar exposure have a significant effect on Grav
In [51]:
model = ols('Grav ~ C(Soil) + C(Biochar) + C(Soil):C(Biochar)', data=df).fit()
In [52]:
sm.stats.anova_lm(model, typ=2)
Out[52]:
sum_sq df F PR(>F)
C(Soil) 0.333354 1.0 285.829386 6.400521e-41
C(Biochar) 0.020300 3.0 5.801913 7.934483e-04
C(Soil):C(Biochar) 0.017062 3.0 4.876608 2.680177e-03
Residual 0.242584 208.0 NaN NaN
  • Since Soil and Biochar have a value less than 0.05, This means we do have sufficient evidence to say Soil and Biochar exposure have a significant effect on Grav
  • We can see that we do get a significant p-value for the interaction, so we can conclude that there is likely to be an additional effect on Grav when we change the level of both factors.

Perform a Tukey's HSD posthoc test on your 2-way ANOVA (Hint: first use the df.dropna() function again, but only using the "Grav" column)¶

  • A. Generate the pair-wise comparison table for a Tukey's HSD posthoc test (from statsmodels.stats.multicomp import pairwise_tukeyhsd — described here)
  • B. In a markdown cell, give an overall interpretation the findings of your posthoc test (please don't describe every pair-wise comparison!). Are there any trends? Does this visually corroborate with your graph from question 2? Does this reinforce your findings from both 1-way ANOVAs?
  • C. In a markdown cell, describe whether every pair-wise comparison tells you something useful?
In [55]:
tukey = pairwise_tukeyhsd(endog=df['Grav'],groups=df['Soil'],alpha=0.05)

print(tukey)
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
=========================================================
 group1    group2   meandiff p-adj  lower   upper  reject
---------------------------------------------------------
Chehalis Willamette  -0.0786 0.001 -0.0883 -0.0689   True
---------------------------------------------------------

The 95% confidence interval for the mean difference between group Chehalis and group Willamette is (-0.0883, -0.0689), and since this interval does not contain zero we know that the difference between these two group means is statistically significant.

Likewise, the p-value for the mean difference between group Chehalis and group Willamette is 0.001, which is smaller than our significance level of 0.05, so this also indicates that the difference between these two group means is statistically significant.

In [68]:
biochar = df['Biochar'].astype(str)
In [70]:
tukey = pairwise_tukeyhsd(endog=df['Grav'],groups=biochar,alpha=0.05)

print(tukey)
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
   350    500    0.007    0.9 -0.0194 0.0333  False
   350    700   0.0043    0.9  -0.022 0.0307  False
   350   None  -0.0179 0.2977 -0.0442 0.0085  False
   500    700  -0.0027    0.9  -0.029 0.0237  False
   500   None  -0.0248 0.0727 -0.0512 0.0015  False
   700   None  -0.0222  0.132 -0.0485 0.0042  False
---------------------------------------------------
  • The 95% confidence interval for the mean difference between group 350 and group 500 is (-0.0194, 0.0333), and since this interval does contain zero we know that the difference between these two group means is not statistically significant.

  • Likewise, the p-value for the mean difference between group 350 and group 500 is 0.9, which is greater than our significance level of 0.05, so this also indicates that the difference between these two group means is not statistically significant.

  • If the interval contains zero, then we know that the difference in group means is not statistically significant. In the example above, All pairwise comparisons are not statistically significant.

  • No it does not corrobrate visually with graph in Question 2

  • Yes it reinforces our findings in One-way ANOVA performed above

  • No not every pairwise comparison does not tell us something useful but some of them definitely helps us find the statistical signifance between groups